arXiv:1502.04058vl [stat.AP] 13 Feb 2015 


Latent modeling of flow cytometry cell 

populations 

Jonas Wallin^, Kerstin Johnsson*^, and Magnus Fontes^’^ 

^Mathematical Sciences, Chalmers and University of Gothenburg 
^Centre for Mathematical Sciences, Lund University 
^Institut Pasteur, Paris 

February 16, 2015 


Abstract 

Flow cytometry is a widespread single-cell measurement technology 
with a multitude of clinical and research applications. Interpretation 
of flow cytometry data is hard; the instrumentation is delicate and can 
not render absolute measurements, hence samples can only be inter¬ 
preted in relation to each other while at the same time comparisons 
are confounded by inter-sample variation. Despite this, current au¬ 
tomated flow cytometry data analysis methods either treat samples 
individually or ignore the variation by for example pooling the data. 
In this article we introduce a Bayesian hierarchical model for studying 
latent relations between cell populations in flow cytometry samples, 
thereby systematizing inter-sample variation. The model is applied to 
a data set containing replicated flow cytometry measurements of sam¬ 
ples from healthy individuals, with informative priors capturing expert 
knowledge. It is shown that the technical variation in the inferred cell 
population sizes is small in comparison to the intrinsic biological varia¬ 
tion. The large size of flow cytometry data, where a single sample can 
contain measurements on hundreds of thousands of cells, necessitates 
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computationally efficient methods. To address this, we have imple¬ 
mented a parallel Markov Chain Monte Carlo scheme for sampling the 
posterior distribution. 

Keywords: Bayesian hierarchical models, flow cytometry, model-based clus¬ 
tering. MSC: Primary 62P10; secondary 62F15, 68U99 


1 Introduction 

In a flow cytometer a number of characteristics for each individual cell in 
a sample of ~ 10"^ to ~ 10® cells are quantifled as they pass through it in 
a fluid stream. The data that are obtained are most often summarized by 
grouping cells into cell populations; properties of these cell populations are 
used in many clinical applications—for example monitoring HIV infection 
and diagnosing blood cancers—and in many branches of medical research 
(Shapiro, 2005; Nolan and Yang, 2007). Deflning the cell populations based 
on the measured characteristics is in state-of-the-art analyses still done man¬ 
ually by trained operators looking at two-dimensional projections of the data. 
However, the importance of automated methods has risen along with an in¬ 
crease of the dimension of typical flow cytometry data sets due developments 
in flow cytometry technology (O’Neill et ah, 2013) and the emergence of 
studies with large numbers of flow cytometry samples (Chen et ah, 2015). 

Automatic cell population identification is hard since flow cytometry mea¬ 
surements are not absolute, while at the same time different samples cannot 
be directly compared due to technical variation—especially apparent when 
samples are analyzed at different laboratories (Welters et ah, 2012)—and in¬ 
trinsic biological variation. Despite this, research into automated population 
identification methods has focused on individual or pooled flow cytometry 
samples, sometimes attempting to align data at first through normalization 
procedures (Hahne et ah, 2010). We introduce a Bayesian hierarchical model 
with latent relations between flow cytometry samples, thereby allowing for a 
systematic study of variation of cell population characteristics in a collection 
of samples and additionally enabling the use of prior information about cell 
populations. 

Splitting the cell measurements in a sample into cell populations is essen¬ 
tially a clustering problem. In the context of flow cytometry data analysis 
clustering is called automatic gating, as opposed to the manual gating per¬ 
formed by operators. Model-based clustering using mixture models has been 
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the most used approach for automated gating (Lo et ah, 2008; Boedigheimer 
and Ferbas, 2008; Chan et ah, 2008; Pyne et ah, 2009; Hu et ah, 2013; Naim 
et ah, 2014). Mixture models are very well suited to describe flow cytometry 
data because they have a natural biological interpretation based on the cell 
populations. Examples of other approaches that have been used for auto¬ 
mated gating are grid based density clustering (Qian et ah, 2010), spectral 
clustering (Zare et ah, 2010), hierarchical clustering (Qiu et ah, 2011; Brug- 
gner et ah, 2014) and k-means clustering (Aghaeepour et ah, 2011; Ge and 
Sealfon, 2012). An evaluation of a wide range of automated gating methods 
was performed in the FlowCAP I challenge (Aghaeepour et ah, 2013). No 
method clearly outperformed the others. 

Apart from pooling data, two approaches for identifying cell populations 
jointly in a collection of flow cytometry samples have been put forward pre¬ 
viously. The first one is to cluster each sample separately and then match 
the resulting clusters (Pyne et ah, 2009; Azad et ah, 2013). The second one 
is a Dirichlet process Bayesian hierarchical model ignoring variation in loca¬ 
tion and shape between cell populations (Cron et ah, 2013). This limitation 
precludes analysis of any such latent variation. 

In our model, the cells in a sample are clustered using a multivariate 
Gaussian mixture model (GMM), where K components describe cell popu¬ 
lations and one component describes outliers. Outliers can arise for example 
from dead cells, non-specific binding of markers and doublets, i.e. pairs or 
groups of cells that pass through the flow cytometer at the same time. For 
each component not representing outliers its mean and covariance matrix is 
linked to a latent cluster which collects corresponding components across all 
samples. In practice this is done by assuming a normal prior for the means 
and an inverse Wishart prior for the covariance matrices of the components 
linked to a given latent cluster. The variation in location and shape between 
corresponding mixture components across samples is controlled by the pri¬ 
ors on parameters of the latent clusters. The location of component means 
and shape of components can also be restricted if there is prior information 
supporting this. The probabilistic assumptions of the model are formulated 
in Section 2.1. In an extension of the model we include the possibility to 
account for that not all populations are present in every sample, which is a 
frequent situation in flow cytometry data sets which is challenging for joint 
analyzes. 

Our model differs from previous Bayesian hierarchical models of mixtures 
(Lopes et ah, 2003; Muller et ah, 2004; Teh et ah, 2006; Salakhutdinov et ah, 
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2010) in that the mixture components forming the Gaussian mixture model 
for one sample are non-exchangeable; they represent different cell populations 
and thus have different priors. Furthermore, latent relations between the 
mean and covariance parameters of mixture components have previously only 
been studied in the case when the covariance matrices were assumed to be 
diagonal (Salakhutdinov et ah, 2010). 

Another challenge that has to be addressed when analyzing flow cytome¬ 
try data is that cell populations in flow cytometry data can be skew and/or 
have heavy tails and are then not fit well by a Gaussian component (Lo et ah, 
2008; Pyne et ah, 2009; Fruhwirth-Schnatter and Pyne, 2010). To handle this 
we use multiple components to describe such populations, an approach that 
have often been employed for flow cytometry data (Finak et ah, 2009; Chan 
et ah, 2008; Baudry et ah, 2010; Naim et ah, 2014) and has the further advan¬ 
tage that the number of cell populations can be automatically detected. We 
merge Gaussian components into super components with a procedure based 
on a systematic study of methods for merging mixture components (Hennig, 
2010); details are given in Section 2.2. 

In Section 3.1 we fit the model to a simulated data set to evaluate the 
ability of the sampling scheme to recover the model parameters. In Section 
3.2 we apply our model to a real world flow cytometry data set where we 
detect both known cell populations with prior information on population 
locations and cell populations without this prior information. We then apply 
Principal Component Analysis (PGA) to the inferred cell population sizes 
and find that biological variation between individuals can be separated from 
technical variation between replicates. 

2 Methods 

2.1 Model 

Let Yjj denote vector valued measurement number i in sample j. Here i e 
{!,..., Uj}, where rij is the number of cells in sample j, and j G (1,..., J}, 
where J is the number of samples. We let the dimension of the observations 
be denoted d. With K components describing cell populations the probability 
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density for cell measurement i of a flow cytometry sample j is modeled as 

K 

/(Y«) = E jk^ij 1 ^^jki ^jk) f^jOi ^jo); (1) 

k=l 

where N{Y ; //,, S) denotes the probability density function of the normal 
distribution with mean jj, and covariance matrix S evaluated at Y. The 
number K is usually chosen to be much larger than the expected number of 
cell populations since for non-Gaussian cell populations many components 
are needed to describe them. The last component represents outliers and its 
parameters = Hq and Sjo = Sq are identical across samples. The vector 
TTj = (vTjo,... ,'^jK} contains the mixing proportions, i.e. the proportion of 
cells described by the component. To connect the cell populations between 
samples we use a latent layer, assuming that for a given k each and Yjk 
is drawn from a normal and an inverse Wishart distribution respectively. 
Speciflcally, in our model, for A: = 1,..., it', 

where 0^, and are hyper-parameters describing latent cluster 

k. The main reason for using the normal and inverse Wishart distribu¬ 
tions is computational efliciency, since these are conjugate priors to the mean 
and the covariance respectively of the normal distribution. We call 9k and 
(r'A: — d — 1) the latent cluster mean and latent cluster covariance matrix 
respectively, since they are the a priori expected values of iXj/. and Yijk. 

For the hyper-parameters describing the latent clusters and the mixing 
proportions we use the following prior distributions: 

0fc|tfc,Sfc--Y(tfc,Sfc), TTj - T>(a), (3) 

-- IW{Qk,ng), Uk\Xk -- exp(-Afc), 

where W denotes the Wishart distribution and D denotes the Dirichlet distri¬ 
bution, which is conjugate prior to the multinomial distribution. For each z/j, 
we assign a exponential prior on the positive natural numbers. The complete 
structure of the model is displayed through a DAG in Fig. 1. 

The parameters and define the prior belief of the location of the 
latent means 6k, whereas the parameters Qj. and ne control the spread of 
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Figure 1: Directed acyclic graph describing the Bayesian hierarchical model. 
Square boxes indicate that the values are known. 


component means within a latent cluster and are hence important to control 
the variation between samples. A large along with a small forces the 
together; it makes large deviations between and unlikely. On 
the other hand, the parameters and control the latent covariance 
matrices and the variation between component covariance matrices. If 
is large each will be close to — d—1) for any k; note that a high 

makes high I'k more probable. 

Finally, to simplify sampling from the posterior distribution of the pa¬ 
rameters, we add an component assignment variable Xij G {0,1,..., iF} de¬ 
scribing which component Yjj is drawn from. To comply with (1), the a pri¬ 
ori uncertainty of component membership is modeled by Xij ~ Mult(7rj, 1), 
where Mult denotes the multinomial distribution. 

The resulting posterior distribution of all the parameters, denoted 0, and 
X given the data Y is given in the Supplemental material. Appendix A. In 
Appendix B we describe the Markov chain Monte Carlo (MCMC) sampling 
scheme used to generate posteriors for our model parameters. 
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2.1.1 Absent components 

In some flow cytometry data sets not all cell populations are present in all 
samples. In our model this corresponds to that iijk = 0 for some (j, fc). 
However, mixture component parameters for empty clusters will still affect 
the mixing of the MCMC for the parameters of the latent cluster. It can also 
happen that if a cluster is empty that the mixture component moves and split 
a neighboring cluster in two. To avoid this in such data sets we extend our 
model by introducing a variable Zj G {0,1}^ that says which components are 
active in sample j. This has the further advantage that when sampling from 
the posterior distribution of the model we get the probability for each cluster 
that it is present in a sample. We impose a prior on Zj which is proportional 
to exp(-c, Ef=i Zi)/(Ef=i Zj > 0) where / denotes the indicator function 
and Cs > 0. The prior makes the model prefer fewer activated clusters so that 
if there is a very small cluster the likelihood will be larger if it is inactivated. 

The changes to (l)-(3) required by this extension are straightforward but 
inference of the model becomes a bit more involved since removing compo¬ 
nents reduces the dimension of the model. To accommodate for this we have 
included a reversible jump step in our sampling algorithm. Details are given 
in the Supplemental Material, Appendix B. 

2.2 Merging latent clusters 

To determine the “correct” number of clusters in a data set directly from 
the data is an ill-defined problem, since what should be considered to be 
a separate cluster depends on the interpretation of the data. Nevertheless, 
there are many different criteria which can be used to guide the decision about 
the number of populations (Friihwirth-Schnatter, 2006; Hennig, 2010). We 
use overlap between components—measured by Bhattacharyya distance— 
and unimodality of the resulting super clusters—measured by Hartigan’s dip 
test (Hartigan and Hartigan, 1985)—to determine which latent clusters to 
merge and to indicate our confldence in the mergers. 

In an evaluation of criteria for merging Gaussian components, the Bhat¬ 
tacharyya distance performed well (Hennig, 2010). Bhattacharyya distance 
merges clusters according to a pattern-based cluster concept as opposed to a 
modality-based concept (Hennig, 2010), meaning that a small dense cluster 
inside a large sparse cluster will not be merged into the sparse cluster when 
their densities are disparate, even when the resulting super cluster would have 
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had a unimodal density. This makes sense for flow cytometry data since two 
such clusters could very well describe two different cell populations. 

The Bhattacharyya distance between Tii) and N{fX 2 ,'Si 2 ) is 

C^bhat = 1/8 • {Hi — H 2 Y (Mi ~ M 2 ) + 1/2 • log ^|S|/, (4) 

where S = (5]i + S 2)/2 (Fukunaga, 1990). In order to measure Bhat¬ 
tacharyya distance between mixtures of Gaussian distributions, which is nec¬ 
essary for deciding if super clusters should be merged with other clusters, we 
approximate each mixture with a Gaussian distribution. The means and the 
covariance matrices are estimated using a soft clustering of the data points 
inferred from the sampling of Xij, detailed in the Supplementary material, 
Appendix G. 

However, it is not obvious how to set a threshold for dbhat) since the ap¬ 
propriate threshold depends on the distribution of the data (Hennig, 2010), 
which is unknown. Because of this we use a low soft threshold di and a 
high hard threshold d 2 - Two clusters closer to each other than di are always 
merged, two clusters whose distance is between di and d 2 are only merged if 
they fulfill an additional criterion based on Hartigan’s dip test for unimodal¬ 
ity. 

Unimodality is an appealing heuristic for defining cell populations, and 
it has frequently been used for automated gating (Ghan et ah, 2008; Ge 
and Sealfon, 2012; Naim et ah, 2014). It has two main limitations. The 
first one, populations which should be separate based on that they have 
disparate densities, even though they are colocalized, can be bypassed by 
combining unimodality with a pattern-based merging criterion such as Bhat¬ 
tacharyya distance. The second one, that it is difficult to determine if a 
multi-dimensional empirical distribution is multimodal, is usually handled by 
considering one-dimensional projections (Hennig, 2010; Naim et ah, 2014). 
This is the approach we take here, using Hartigan’s dip test of unimodality 
for each of the projections onto the coordinate axis and for the projection 
onto Fisher’s discriminant coordinate. If for a proposed merger, any of these 
projections is found to be multimodal, the clusters are not merged. Fur¬ 
ther details of the merging procedure is given in the Supplemental material, 
Appendix C. 
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Figure 2: One and two dimensional histograms for one flow cytometry sam¬ 
ple containing 15,000 in (a), and histograms of 15,000 data points drawn 
uniformly from the pooled data in (b). 

3 Experiments 

3.1 Simulated data 

In other to test the performance of the proposed sampling scheme, we use it 
on a simulated data set. The dimension of the simulated data is, for visual 
reasons, set to three. We use four latent clusters and generate eighty artificial 
flow cytometry samples. Each sample has measurements of 15,000 cells. In 
order to test the ability to detect if a population is present in a sample or 
not, one of the latent clusters is present in only eight samples and one latent 
cluster is present in twentyfour samples. The cluster which is present in only 
eight samples represents a small cell population, containing 1% of the total 
number of cells. The parameters and the algorithm used for generating the 
data are given in the Supplementary material, Appendix D.l. 

In Fig. 2 we show univariate and bivariate histograms of all cell measure¬ 
ments pooled together, as well as the corresponding histograms of the data 
from a single sample where all four clusters are present. Note that the data 
when pooled together has a complicated density, as it is in fact a mixture of 
232 multivariate normal densities. 

The priors are chosen to be non-informative, the prior parameters as well 
as the initial values for the MCMC sampler are given in the Supplemental 
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Figure 3: One and two dimensional histograms of 15,000 posterior draws of 
Y for the flow cytometry sample displayed in Fig. 2 (a). In (b) we show 
15,000 posterior draws of Y drawn uniformly from all the flow cytometry 
samples, thus matching Fig. 2 (b). 


material. Appendix D.l. The posterior distribution is explored by generating 
samples of the parameters in 10^ iterations, after a burn-in period of 10^ 
iterations from the MCMC sampler. In each iteration we also generate a 
sample of Y, i.e. a sample from the posterior predictive. In Fig. 3 we mimic 
the plots in Fig. 2, with the samples coming from the posterior predictive 
distribution of Y. Fig. 4 displays dots at the posterior mean locations of the 
mixture component means whose posterior probability of being active 
is greater than 1%; the true locations of the active clusters are displayed as 
circles. The model is able to detect which clusters that are active and which 
are not, and to find the location of the component means. 

Finally in Fig. 5 and 6, the marginal posterior distributions of the latent 
cluster parameters 9^ and subtracted by their true values, are presented. 
In Fig. 5 the dot represents the difference between the median of posterior 
distribution and the true value of each Ok- The vertical lines represent the 
2.5% and 97.5% quantiles. Fig. 6 displays results for each latent covariance 
matrix ^k/{,i^k — 4) in an analogous way. From Fig. 5 and Fig. 6 we see that 
the true parameters of both the means and the covariances are all between 
the 2.5% and 97.5% quantiles of the posterior distribution. 
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Figure 4: The posterior mean of the clusters centers, (dots), and the true 
cluster centers (circles). 
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Figure 5: The difference between the true value of each entry in each 6^ and 
the approximated marginal posterior distribution generated by the MCMC 
sampler. The black dot represents the median and the vertical line goes 
between the 2.5% and 97.5% quantiles. The light gray horizontal line is the 
0 line. 
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Figure 6: The difference between the true value of each of the entries in 
— 4) and the approximated marginal posterior distribution generated 
by the MCMC sampler. The black dot shows the median, and the black 
vertical line goes between the 2.5% and 97.5% quantiles. The light gray 
horizontal line is the 0 line. 

3.2 Flow cytometry data 

We use our model to study lymphocyte cell populations among peripheral 
mononuclear blood cells (PMBC) from four healthy donors. The data has 
previously been used in the evaluation of a population matching algorithm 
(Azad et ah, 2013) and is available from the R package healthyFlowData. 
Lymphocytes are the leading actors in the adaptive immune response and 
also play a role in the innate immune response. It is well known that the 
sizes of the lymphocyte subpopulations vary between individuals (Jentsch- 
Ullrich et ah, 2005). 

The blood from each of the four donors was split into five parts and then 
analyzed, the data set has therefore twenty flow cytometry samples. The data 
in healthyFlowData has been preprocessed as detailed in the Supplemental 
material, Appendix E.l. As an additional preprocessing step we scaled it 
using the 1% and 99% percentiles go.oi and go .99 of the pooled data, with 
the same scaling for all samples, so that go.oi = 0 and go .99 = 1 for each 
marker for the pooled data. Univariate and bivariate histograms of one of 
the samples as well as all samples pooled together are shown in Fig. 7. 

The data set has measurements of the four markers CD4, CDS, CDS and 
CD19. If a cell population has a high expression of a certain marker, say CD4, 
it is said that it is positive for this marker, denoted CD4+. A population 
with low expression is called negative and if the expression is intermediate it 
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Figure 7: One and two dimensional histograms of flow cytometry data. 

(a) One of the samples from healthyFlowData, with 19,160 data points. 

(b) 19,160 data points drawn from the twenty samples in healthyFlowData 
pooled together. 

is called dim or bright. It is well known that using the four markers above, 
five types of lymphocytes can be distinguished (Murphy et ah, 2012). CD19 
is a pan B-cell marker, i.e. it is highly expressed on all B-cells, and CDS is 
a pan T-cell marker. The lymphocytes with low expression of CDS or CD19 
are natural killer (NK) cells. There are two major cell populations among 
the T cells, namely helper T cells and cytotoxic T cells. The helper T cells 
are CD4 positive and CDS negative whereas the cytotoxic T cells are CDS 
positive and CD4 negative. There is also a CD4-CD8- population comprising 
mainly so called 7(5 T cells (Gertner et ah, 2007). Table 1 summarizes these 
subpopulations; we want to follow them in the twenty samples, while at the 
same time accounting for unknown populations due to incomplete biological 
knowledge or technical artifacts. 

Often in Bayesian analysis of mixture models one tries to construct as 
non-informative priors as possible (Richardson and Green, 1997,Boeder and 
Wasserman, 1997). However, in our setup we have two kinds of prior infor¬ 
mation that we want to utilize. First, we have semi-quantitative information 
about marker expression in the previously known populations described in 
Table 1. Even though there is no reference value for marker expression of 
a positive population (Shapiro, 2005)—since we have positive and negative 
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Table 1: Lymphocyte subpopulations. 



GD4 

CDS 

CDS 

CD19 

B cells 



- 

+ 

Helper T cells 

+ 

- 

+ 

- 

Gytotoxic T cells 

- 

+ 

+ 

- 

CD4-GD8- T cells 

- 

- 

+ 

- 

NK cells 



- 

- 


populations present for each of the four markers in the data—we can set 
relative values for the prior parameters based on the scaling of the data. 
Second, since the sample components linked to one latent cluster are sup¬ 
posed to represent one and the same cell population, the spread of their 
means must be reasonable. That is, for any latent cluster, the sample com¬ 
ponent means linked to it should be closer to its own mean than to the means 
of latent clusters that do not represent the same population, i.e. that are not 
merged with it during post-processing. This second piece of information is 
used to set n^, and Q;^. It is harder to translate to precise values of 
the parameters, but it can easily be detected if the priors allow too much 
variation. The chosen parameters as well as the initialization procedure are 
described in the Supplementary material Appendix E.2. 

Apart from the five latent clusters with informative priors on their la¬ 
tent means we also have a number of additional latent components with 
non-informative priors on their means. These additional components can 
capture outliers and unknown populations, but they can also capture parts 
of the known populations, especially if the populations are non-Gaussian and 
hence cannot be well approximated by one Gaussian component. We merge 
components with sufficient overlap as described in Section 2.2 after inferring 
the model to obtain the populations in one piece. With a higher number of 
components the data set can be described more accurately, but the interpre¬ 
tation of each component gets less clear. We use in total K = 17 components 
since we found that this was sufficient to describe the data set well, in the 
sense that one- and two-dimensional marginal distributions are accurate. 

With these priors we get in most cases appropriate variation between 
components when rerunning the analysis below, however sometimes (two out 
of six cases) the variation in the location of one of the seventeen clusters is 
a bit too high. 
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We run 10^ burn-in iterations of our algorithm and then R = 10^ produc¬ 
tion iterations to get samples of ^ ~ 

1,..., i? from the posterior distribution. Convergence of the MCMC sampler 
is established using trace plots, displayed in the Supplementary material, 
Appendix E.3. We use the means of 0^*^^ and — d — 1) 

to get point estimates of sample component and latent cluster means and 
covariance matrices; the means of are used to get point estimates of the 
mixing proportions. Furthermore, in each iteration we draw one synthetic 
cell measurement from the models of selected flow cytometry samples and 
one from the model of the pooled data, in order to evaluate how well the 
model fits the data. 

Fig. 8 (a) shows univariate and bivariate histograms of the synthetic cell 
measurements for one flow cytometry sample. Fig. 3 and Fig. 4 in the Supple¬ 
mentary material display histograms of synthetic measurements of another 
sample as well as of the pooled data. From these results it is clear that the 
inferred model is accurate and captures the variation across samples, which 
a model of pooled data cannot do. 

Merging of latent components results in six cell populations, summary 
statistics of these are shown in Fig. 9. Two of the populations fit the NK cell 
expression pattern in Table 1; there are multiple possible explanations of this. 
One explanation is be that one of the population is monocytes, which have 
front and side scatter properties similar to lymphocytes (BD Biosciences, 
2000)—making it plausible that some of them have been included—and also 
are CD3-CD19-. Another explanation could be that there are two NK pop¬ 
ulations which are differentially activated. There are strong evidence that 
there are two separate populations, see for instance the histograms of their 
joint CD4 expression shown in Fig. 6 in the Supplementary material. 

The component representing outliers has only a very small proportion 
of the cells assigned to it, the median of djo across samples is 0.0003, the 
maximum is 0.0011. 

The variation between flow cytometry samples is systematized in the hi¬ 
erarchical model, results of this can be seen in Fig. 9 (a) and Fig. 8. For com¬ 
parison, we fit Gaussian mixture models using the expectation-maximization 
algorithm to each flow cytometry sample separately. In this case there are no 
clear correspondences among the mixture components between samples, as 
seen in Fig. 10. When the data set was studied previously with an algorithm 
matching populations found by separate analysis of the samples, this was 
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(a) 


(b) 


Figure 8: Bayesian hierarchical model of lymphocyte subsets of PBMC from 
healthy individuals, (a) One and two dimensional histograms of 19,160 syn¬ 
thetic data points generated from the inferred model of the flow cytometry 
sample depicted in Fig. 7 (a), (b) Component parameter representations of 
inferred latent clusters and mixture components across the twenty flow cy¬ 
tometry samples. The center of each ellipse is the mean and each semi-axis 
is an eigenvector with length given by the corresponding eigenvalue of the 
projected covariance matrix. In the left column parameters of latent clusters 
k) are shown, in the right column parameters for the mixture 
components in each sample ’^jk) are shown. Each component or cluster 
is depicted with the same color as in Fig. 9; different shades of same color 
corresponds to latent clusters that have been merged. 
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Figure 9: Summary statistics of the six inferred cell populations, ordered by 
population size, (a) Component mean locations for each population. Differ¬ 
ent shades of a color in the same plot corresponds to different latent com¬ 
ponents form the same super component. Populations: CD4 T-cells (red), 
CDS T-cells (yellow), B-cells (green), monocytes or NK cells (turquoise), 
CD4-CD8- T-cells (blue), and monocytes or NK cells (purple), (b) Box- 
plots of the soft clusters. The boxes go between qkm, 0.25 and qtm, 0.75 and the 
whiskers extend to qkm, 0.01 and qkm, 0 . 99 - The «-quantile for (merged) com¬ 
ponent k in dimension m, qkm,a^ is here defined as qkm,a = minj/j/lliq/^ : 
^ (c) Population proportions for each of the flow 

cytometry samples. 
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Figure 10: Component parameter representations of inferred mixture com¬ 
ponents in independent Gaussian mixture models of three flow cytometry 
samples. The two samples depicted in the two right columns are techni¬ 
cal replicates. Note that there is no correspondence between colors between 
columns. 

only done with a coarse partition of the cell measurements, with four cell 
populations (Azad et ah, 2013). 

In downstream analysis of flow cytometry data the locations and shapes 
of the cell populations as well as their sizes can be used. For example, one cell 
population might have varying expression of one marker in different samples; 
this can be studied using the mean parameters ^6^^. We use cell population 
sizes to show how technical variation between replicates can be separated 
from biological variation between donors. 

We estimate the size of each cell population in each sample by sum of the 
mixing proportions for the sample components representing that population. 
We then do a principal component analysis on the population sizes. Almost 
all of the variance, 99.3%, is captured in the first two principal components. 
A biplot onto the two first principal components is shown in Fig. 11. We 
see that the biological variation between donors is much larger than the 
technical variation between replicates, samples from different donors are well 
separated. 
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Figure 11: PC A biplot. Colors for cell populations are the same as in Fig. 
9. Samples from the same donor are plotted with the same marker. Samples 
from different donors are well separated. 

4 Discussion 

In this paper we have presented a new Bayesian hierarchical model designed 
for joint population identification in many flow cytometry samples. The 
model captures the variability in shapes and locations of the populations be¬ 
tween the samples and has the possibility to include expert knowledge. We 
showed that for a synthetic data set generated from the model, the parame¬ 
ters were recovered with high accuracy through a MCMC sampling scheme. 
The sampling scheme was then applied to a real flow cytometry data set which 
contained five populations whose expression patterns were well-known to ex¬ 
perts. After merging latent clusters using criteria based on Bhattacharyya 
distance and Hartigan’s dip test, six populations were obtained—apart from 
the previously known populations, one additional population which could 
either be monocytes or NK cells. 

How much clusters should be merged is however a decision that needs to 
be taken by the interpreter of the data. The criteria we have used should 
be taken as guidelines. Other merging criteria, for example directly esti¬ 
mated misclassiflcation probabilities (Hennig, 2010), and diagnostic plots 
which evaluate the separation of clusters could also be utilized to guide such 
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decisions. As an example, in the analysis of the data set studied in this arti¬ 
cle, a flow cytometry data analyst might want to consider the small cluster 
which is CD8+ and has intermediate CD4 expression to be separate from the 
CDS T-cells. 

For very large flow cytometry data sets the running time can be pro¬ 
hibitive and downsampling is then required. A direction for future research 
is to And ways to downsample so that also small cell populations can be 
resolved, as has been done for the analysis of single flow cytometry samples 
(Naim et ah, 2014). 

The priors that we used in the real data experiments were rudimentary; 
incorporating more detailed knowledge about cell populations could speed 
up convergence. With smart ways of formulating priors based on popula¬ 
tions found in other flow cytometry data sets—possibly with different sets of 
markers than the data set under study—our method could lead the way to 
incremental learning of flow cytometry populations where well-known popu¬ 
lations will be easily characterized. 

There are interesting approaches to analysis of multiple flow cytometry 
data samples in parallel which are not based on clustering (Aghaeepour et ah, 
2013). But the power of clustering approaches is to provide a comprehensive 
description of all cell populations in the samples which can reveal subtle sig¬ 
nals and systematic perturbations. Finding good clustering algorithms which 
can serve as reliable automatic gating methods is therefore an important task. 
We feel strongly that analyzing many samples jointly, and using expert prior 
knowledge when doing automated gating is an important requirement to get 
the high performance which is needed. 

5 Software 

Our implementation of the MCMC sampling is available as a Python package 
at https: //github. com/ JonasWallin/BayesFlow. Its parallel implementa¬ 
tion is based on OpenMPI through the Python package mpi4py. 

6 Supplementary material 

Supplementary material is available. It contains the posterior, the MCMC 
sampling scheme, additional details on the merging of components, infor- 
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mation about the data generation, priors and initialization for the synthetic 
data example; additional details on the flow cytometry data set, the priors 
and the initialization procedure used when studying this data set and further 
results pertaining to the flow cytometry data set. 
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